Note: Initial data sets were cleaned and merged prior to this analysis. The original data sets’ structure and cleaning steps may be found in the file us_state_death_trends_wrangling.
This report explores the Center for Disease Control’s (CDC) Weekly Morbidity and Mortality data from 2014 through the ——>1st quarter of 2021<——–. This data analysis is focused on National level weekly death counts for specific causes at the United States. A 2nd report will analyze specific state and local regions. The data is voluntary and not guaranteed to be reported in a timely or regular basis. Therefore, the most recent data (quarter) will be may be incomplete and unreliable for analysis and insights.
More information on the original data sets can be found at CDC’s website using the following links:
Note: The most current data can be downloaded using the links above. To use the data sets without any code modifications, the user will need to:
# Create United States subset
us_deaths_df <- mmwr_1421_df[Location == "United States", ]
# Dropping unused levels (only United States occurs).
us_deaths_df$Location <- droplevels(us_deaths_df$Location)
melt(us_deaths_df[ Week_End_Date > as.Date("2019-09-01") &
Week_End_Date < as.Date("2020-03-31"),
Week_End_Date:Abnormal_Finding],
id.vars = "Week_End_Date") %>%
ggplot(aes(x=Week_End_Date, y = value, group=variable)) +
geom_point() +
geom_vline(xintercept=as.Date("2020-01-01"), linetype="dotted") +
facet_wrap(~ variable, scales = 'free_y') +
theme_bw()
There are vertical gaps (jumps) seen at the intersection of the two data sets for: - Influenza and Pneumonia - Other Respiratory - Abnormal Finding
After the jumps, the data settles into a pattern consistent with the trend of the previous years’ data. These gaps are likely related to Covid-19 cases that were undiagnosed due to Covid-19 testing not being available until March of 2020 and not being widely available (without restricted use) until May of 2020. In addition, there was not a Covid-19 diagnosis of death code available in the United States on January 1st of 2020 and therefore any deaths would have been diagnosed as another general respiratory category such as these three.
In addition to the gaps, there are notable peaks at the merge point (date) of the two data sets. These peaks are generally smooth before and after, indicating a local anomaly with a true long-term pattern.
The analysis will comprise mostly of averages and comparisons of year to year descriptive statistics. Therefore, 1-2 week gaps and trends will not effect the analysis.
head(us_deaths_df, 3)
tail(us_deaths_df, 3)
str(us_deaths_df)
## Classes 'data.table' and 'data.frame': 398 obs. of 18 variables:
## $ Location : Factor w/ 1 level "United States": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ Week : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Week_End_Date : Date, format: "2014-01-04" "2014-01-11" ...
## $ Natural : int 50189 52450 51043 50560 50402 49790 50175 49010 47907 48353 ...
## $ Heart : int 13166 13663 12928 12813 12896 12681 12984 12577 12248 12318 ...
## $ Cancer : int 11244 11504 11496 11629 11584 11355 11477 11478 11251 11535 ...
## $ Lower_Respiratory : int 3331 3444 3333 3467 3283 3351 3303 3047 3008 3043 ...
## $ Brain : int 2669 2738 2714 2720 2699 2684 2669 2799 2630 2529 ...
## $ Alzheimer : int 1780 1917 1914 1862 1867 1873 1843 1814 1776 1830 ...
## $ Diabetes : int 1654 1735 1660 1602 1586 1643 1642 1564 1588 1536 ...
## $ Covid_19_Multi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Covid_19 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Influenza_Pneumonia: int 1639 1910 1920 1765 1642 1528 1472 1269 1228 1215 ...
## $ Kidney : int 965 1098 1056 1029 998 1038 1021 973 1018 1040 ...
## $ Other_Respiratory : int 756 845 812 753 720 728 739 731 687 760 ...
## $ Septicemia : int 882 905 919 845 890 849 851 708 779 777 ...
## $ Abnormal_Finding : int 679 665 598 622 664 641 638 643 595 642 ...
## - attr(*, ".internal.selfref")=<externalptr>
After preprocessing and cleaning, the United States (U.S.) subset of data used in this analysis accounts for 398 observations and 18 features which includes categorical location data, chronological date and week of year information, and integer weekly disease death data. The earliest week is the 1st week of January 2014. The most current as of this analysis is the week ending January 24, 2021. Next, is the data set’s summary statistics in the U.S. subset.
describe(us_deaths_df[, Location:Week_End_Date])
## us_deaths_df[, Location:Week_End_Date]
##
## 4 Variables 398 Observations
## --------------------------------------------------------------------------------
## Location
## n missing distinct value
## 398 0 1 United States
##
## Value United States
## Frequency 398
## Proportion 1
## --------------------------------------------------------------------------------
## Year
## n missing distinct Info Mean Gmd
## 398 0 8 0.984 2017 2.537
##
## lowest : 2014 2015 2016 2017 2018, highest: 2017 2018 2019 2020 2021
##
## Value 2014 2015 2016 2017 2018 2019 2020 2021
## Frequency 53 52 52 52 52 52 53 32
## Proportion 0.133 0.131 0.131 0.131 0.131 0.131 0.133 0.080
## --------------------------------------------------------------------------------
## Week
## n missing distinct Info Mean Gmd .05 .10
## 398 0 53 1 25.83 17.31 3.00 5.70
## .25 .50 .75 .90 .95
## 13.00 25.00 38.75 47.00 50.00
##
## lowest : 1 2 3 4 5, highest: 49 50 51 52 53
## --------------------------------------------------------------------------------
## Week_End_Date
## n missing distinct Info Mean Gmd .05
## 398 0 398 1 2017-10-24 931 2014-05-22
## .10 .25 .50 .75 .90 .95
## 2014-10-08 2015-11-29 2017-10-24 2019-09-19 2020-11-09 2021-03-28
##
## lowest : 2014-01-04 2014-01-11 2014-01-18 2014-01-25 2014-02-01
## highest: 2021-07-17 2021-07-24 2021-07-31 2021-08-07 2021-08-14
## --------------------------------------------------------------------------------
There are 398 observations beginning with the week ending date of 01-04-2021 to 08-14-2021. The years 2014 and 2020 have a week number 53 due to leap year’s extra day and the last week of the year being splitting the following year, This 53rd week will not significantly affect the results of the analysis due to the analysis using year long and quarterly averaging. Year 2021 only has 32 weeks. Any analysis after week 32 will show 2020 to previous year comparisons. After further review of the data, analysis will be limited to the end of the 2nd quarter of 2021 (see “Examining possible reporting delays in last quarter of data below”).
as.data.frame(describeBy(us_deaths_df[ , Natural:Abnormal_Finding]))
as.data.frame(describeBy(us_deaths_df[ Year > 2019,
.(Covid_19_Multi, Covid_19)]))
us_deaths_df[ , Natural:Abnormal_Finding]%>%
hist()
Most of the distributions above show data has a normal-like distribution with many having right-skew. The Covid categories are effected by the zeros from years before 2020 when there was not a Covid-19 diagnosis. These will be removed by limiting the data to the years 2020 and greater. The cancer and abnormal categories will benefit from a larger bin size.
par(mfrow=c(2,2))
hist(us_deaths_df[ , Cancer], breaks = 32,
main = "Cancer",
xlab = "Weekly Cancer Deaths (bins=32)")
hist(us_deaths_df[ Year > 2019, Covid_19_Multi], breaks = 14,
main = "Covid-19 Comorbidity",
xlab = "Weekly Covid-19 Comorbidity Deaths (bins=14)")
hist(us_deaths_df[ Year > 2019, Covid_19], breaks = 14,
main = "Covid-19 Singular Cause",
xlab = "Weekly Covid-19 Deaths (bins=14)")
hist(us_deaths_df[ , Abnormal_Finding], breaks = 51,
main = "Abnormal Finding",
xlab = "Weekly Abnormal Finding Deaths (bins=51)")
par(mfrow=c(1,1))
The cancer and abnormal findings categories show normal-like data distributions with cancer having possible outliers to the left and abnormal findings having significant outliers to the right. Outliers are expected with abnormal findings due to the diagnosis being itself an outlier from all other diagnoses. Cancer would be expected to be consistent with no weeks being significantly different than others. Therefore, cancer will require further analysis.
plot(us_deaths_df$Week_End_Date, us_deaths_df$Cancer)
The outliers occur during the last 3 weeks of the data and possibly extend further. This is likely due to reporting delays and will affect all categories.
melt(us_deaths_df[ ,
Week_End_Date:Abnormal_Finding],
id.vars = "Week_End_Date") %>%
ggplot(aes(x=Week_End_Date, y = value, group=variable)) +
geom_point() +
geom_vline(xintercept=as.Date("2021-07-01"), linetype="solid", color = "blue") +
facet_wrap(~ variable, scales = 'free_y') +
theme_bw()
All categories except abnormal findings and Covid-19 related categories show the same pattern of outliers likely due to delayed reporting. Some categories, such as Alzheimers, show possible delayed reporting since the end of the 2nd quarter (blue line at 2021-07-01). To prevent possible errors, the analysis will be limited to data through the 2nd quarter of 2021 (2021-06-30).
us_deaths_df[ , Natural:Abnormal_Finding]%>%
hist()
par(mfrow=c(2,2))
hist(us_deaths_df[ , Cancer], breaks = 32,
main = "Cancer",
xlab = "Weekly Cancer Deaths (bins=32)")
hist(us_deaths_df[ Year > 2019, Covid_19_Multi], breaks = 14,
main = "Covid-19 Comorbidity",
xlab = "Weekly Covid-19 Comorbidity Deaths (bins=14)")
hist(us_deaths_df[ Year > 2019, Covid_19], breaks = 14,
main = "Covid-19 Singular Cause",
xlab = "Weekly Covid-19 Deaths (bins=14)")
hist(us_deaths_df[ , Abnormal_Finding], breaks = 51,
main = "Abnormal Finding",
xlab = "Weekly Abnormal Finding Deaths (bins=51)")
par(mfrow=c(1,1))
us_deaths_df <- us_deaths_df[Week_End_Date < as.Date("2021-07-01")]
melt(us_deaths_df[ ,
Week_End_Date:Abnormal_Finding],
id.vars = "Week_End_Date") %>%
ggplot(aes(x=Week_End_Date, y = value, group=variable)) +
geom_point() +
facet_wrap(~ variable, scales = 'free_y') +
theme_bw()
psych::pairs.panels(us_deaths_df[ , Natural:Abnormal_Finding], scale = TRUE)
psych::pairs.panels(us_deaths_df[ Year > 2019 , Natural:Abnormal_Finding], scale = TRUE)
psych::pairs.panels(us_deaths_df[ Year > 2019,
.(Natural, Heart, Brain, Alzheimer, Diabetes,
Covid_19_Multi, Covid_19)],
scale = TRUE)